The ISB-CGC open-access TCGA tables in Big-Query

The goal of this notebook is to introduce you to a new publicly-available, open-access dataset in BigQuery. This set of BigQuery tables was produced by the ISB-CGC project, based on the open-access TCGA data available at the TCGA Data Portal. You will need to have access to a Google Cloud Platform (GCP) project in order to use BigQuery. If you don't already have one, you can sign up for a free-trial or contact us and become part of the community evaluation phase of our Cancer Genomics Cloud pilot. (You can find more information about this NCI-funded program here.)

We are not attempting to provide a thorough BigQuery or IPython tutorial here, as a wealth of such information already exists. Here are links to some resources that you might find useful:

  • BigQuery,
  • the BigQuery web UI where you can run queries interactively,
  • IPython (now known as Jupyter), and
  • Cloud Datalab the recently announced interactive cloud-based platform that this notebook is being developed on.

There are also many tutorials and samples available on github (see, in particular, the datalab repo and the Google Genomics project).

In order to work with BigQuery, the first thing you need to do is import the gcp.bigquery package:


In [6]:
import gcp.bigquery as bq

The next thing you need to know is how to access the specific tables you are interested in. BigQuery tables are organized into datasets, and datasets are owned by a specific GCP project. The tables we are introducing in this notebook are in a dataset called tcga_201607_beta, owned by the isb-cgc project. A full table identifier is of the form <project_id>:<dataset_id>.<table_id>. Let's start by getting some basic information about the tables in this dataset:


In [7]:
d = bq.DataSet('isb-cgc:tcga_201607_beta')
for t in d.tables():
  print '%10d rows  %12d bytes   %s' \
      % (t.metadata.rows, t.metadata.size, t.name.table_id)


      6322 rows       1729204 bytes   Annotations
     23797 rows       6382147 bytes   Biospecimen_data
     11160 rows       4201379 bytes   Clinical_data
   2646095 rows     333774244 bytes   Copy_Number_segments
3944304319 rows  445303830985 bytes   DNA_Methylation_betas
 382335670 rows   43164264006 bytes   DNA_Methylation_chr1
 197519895 rows   22301345198 bytes   DNA_Methylation_chr10
 235823572 rows   26623975945 bytes   DNA_Methylation_chr11
 198050739 rows   22359642619 bytes   DNA_Methylation_chr12
  97301675 rows   10986815862 bytes   DNA_Methylation_chr13
 123239379 rows   13913712352 bytes   DNA_Methylation_chr14
 124566185 rows   14064712239 bytes   DNA_Methylation_chr15
 179772812 rows   20296128173 bytes   DNA_Methylation_chr16
 234003341 rows   26417830751 bytes   DNA_Methylation_chr17
  50216619 rows    5669139362 bytes   DNA_Methylation_chr18
 211386795 rows   23862583107 bytes   DNA_Methylation_chr19
 279668485 rows   31577200462 bytes   DNA_Methylation_chr2
  86858120 rows    9805923353 bytes   DNA_Methylation_chr20
  35410447 rows    3997986812 bytes   DNA_Methylation_chr21
  70676468 rows    7978947938 bytes   DNA_Methylation_chr22
 201119616 rows   22705358910 bytes   DNA_Methylation_chr3
 159148744 rows   17968482285 bytes   DNA_Methylation_chr4
 195864180 rows   22113162401 bytes   DNA_Methylation_chr5
 290275524 rows   32772371379 bytes   DNA_Methylation_chr6
 240010275 rows   27097948808 bytes   DNA_Methylation_chr7
 164810092 rows   18607886221 bytes   DNA_Methylation_chr8
  81260723 rows    9173717922 bytes   DNA_Methylation_chr9
  98082681 rows   11072059468 bytes   DNA_Methylation_chrX
   2330426 rows     263109775 bytes   DNA_Methylation_chrY
   1867233 rows     207365611 bytes   Protein_RPPA_data
   5356089 rows    5715538107 bytes   Somatic_Mutation_calls
   5738048 rows     657855993 bytes   mRNA_BCGSC_GA_RPKM
  38299138 rows    4459086535 bytes   mRNA_BCGSC_HiSeq_RPKM
  44037186 rows    5116942528 bytes   mRNA_BCGSC_RPKM
  16794358 rows    1934755686 bytes   mRNA_UNC_GA_RSEM
 211284521 rows   24942992190 bytes   mRNA_UNC_HiSeq_RSEM
 228078879 rows   26877747876 bytes   mRNA_UNC_RSEM
  11997545 rows    2000881026 bytes   miRNA_BCGSC_GA_isoform
   4503046 rows     527101917 bytes   miRNA_BCGSC_GA_mirna
  90237323 rows   15289326462 bytes   miRNA_BCGSC_HiSeq_isoform
  28207741 rows    3381212265 bytes   miRNA_BCGSC_HiSeq_mirna
 102234868 rows   17290207488 bytes   miRNA_BCGSC_isoform
  32710787 rows    3908314182 bytes   miRNA_BCGSC_mirna
  26763022 rows    3265303352 bytes   miRNA_Expression

These tables are based on the open-access TCGA data as of July 2016. The molecular data is all "Level 3" data, and is divided according to platform/pipeline. See here for additional details regarding the TCGA data levels and data types.

Additional notebooks go into each of these tables in more detail, but here is an overview, in the same alphabetical order that they are listed in above and in the BigQuery web UI:

  • Annotations: This table contains the annotations that are also available from the interactive TCGA Annotations Manager. Annotations can be associated with any type of "item" (eg Patient, Sample, Aliquot, etc), and a single item may have more than one annotation. Common annotations include "Item flagged DNU", "Item is noncanonical", and "Prior malignancy." More information about this table can be found in the TCGA Annotations notebook.
  • Biospecimen_data: This table contains information obtained from the "biospecimen" and "auxiliary" XML files in the TCGA Level-1 "bio" archives. Each row in this table represents a single "biospecimen" or "sample". Most participants in the TCGA project provided two samples: a "primary tumor" sample and a "blood normal" sample, but others provided normal-tissue, metastatic, or other types of samples. This table contains metadata about all of the samples, and more information about exploring this table and using this information to create your own custom analysis cohort can be found in the Creating TCGA cohorts (part 1) and (part 2) notebooks.
  • Clinical_data: This table contains information obtained from the "clinical" XML files in the TCGA Level-1 "bio" archives. Not all fields in the XML files are represented in this table, but any field which was found to be significantly filled-in for at least one tumor-type has been retained. More information about exploring this table and using this information to create your own custom analysis cohort can be found in the Creating TCGA cohorts (part 1) and (part 2) notebooks.
  • Copy_Number_segments: This table contains Level-3 copy-number segmentation results generated by The Broad Institute, from Genome Wide SNP 6 data using the CBS (Circular Binary Segmentation) algorithm. The values are base2 log(copynumber/2), centered on 0. More information about this data table can be found in the Copy Number segments notebook.
  • DNA_Methylation_betas: This table contains Level-3 summary measures of DNA methylation for each interrogated locus (beta values: M/(M+U)). This table contains data from two different platforms: the Illumina Infinium HumanMethylation 27k and 450k arrays. More information about this data table can be found in the DNA Methylation notebook. Note that individual chromosome-specific DNA Methylation tables are also available to cut down on the amount of data that you may need to query (depending on yoru use case).
  • Protein_RPPA_data: This table contains the normalized Level-3 protein expression levels based on each antibody used to probe the sample. More information about how this data was generated by the RPPA Core Facility at MD Anderson can be found here, and more information about this data table can be found in the Protein expression notebook.
  • Somatic_Mutation_calls: This table contains annotated somatic mutation calls. All current MAF (Mutation Annotation Format) files were annotated using Oncotator v1.5.1.0, and merged into a single table. More information about this data table can be found in the Somatic Mutations notebook, including an example of how to use the Tute Genomics annotations database in BigQuery.
  • mRNA_BCGSC_HiSeq_RPKM: This table contains mRNAseq-based gene expression data produced by the BC Cancer Agency. (For details about a very similar table, take a look at a notebook describing the other mRNAseq gene expression table.)
  • miRNA_expression: This table contains miRNAseq-based expression data for mature microRNAs produced by the BC Cancer Agency. More information about this data table can be found in the microRNA expression notebook.

Where to start?

We suggest that you start with the two "Creating TCGA cohorts" notebooks (part 1 and part 2) which describe and make use of the Clinical and Biospecimen tables. From there you can delve into the various molecular data tables as well as the Annotations table. For now these sample notebooks are intentionally relatively simple and do not do any analysis that integrates data from multiple tables but once you have a grasp of how to use the data, developing your own more complex analyses should not be difficult. You could even contribute an example back to our github repository! You are also welcome to submit bug reports, comments, and feature-requests as github issues.

A note about BigQuery tables and "tidy data"

You may be used to thinking about a molecular data table such as a gene-expression table as a matrix where the rows are genes and the columns are samples (or vice versa). These BigQuery tables instead use the tidy data approach, with each "cell" from the traditional data-matrix becoming a single row in the BigQuery table. A 10,000 gene x 500 sample matrix would therefore become a 5,000,000 row BigQuery table.